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Semi-Automated Text Classification (SATC) may be defined as the task of ranking a set V of automatically 
labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects 
where appropriate) the documents in a top-ranked portion of V with the goal of increasing the overall 
labelling accuracy of V, the expected increase is maximized. An obvious SATC strategy is to rank V so 
that the documents that the classifier has labelled with the lowest confidence are top-ranked. In this work 
we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the 
notion of validation gain, defined as the improvement in classification effectiveness that would derive by 
validating a given automatically labelled document. We also propose a new effectiveness measure for SATC- 
oriented ranking methods, based on the expected reduction in classification error brought about by partially 
validating a list generated by a given ranking method. We report the results of experiments showing that, 
with respect to the baseline method above, and according to the proposed measure, our utility-theoretic 
ranking methods can achieve substantially higher expected reductions in classification error. 

Categories and Subject Descriptors: Information systems [Information retrieval]: Retrieval tasks 
and goals— Clustering and Classification ; Computing methodologies [Machine learning]: Learning 
paradigms— Supervised learning 

General Terms: Algorithm, Design, Experimentation, Measurements 

Additional Key Words and Phrases: Text classification, supervised learning, semi-automated text classifica¬ 
tion, cost-sensitive learning, ranking 


1. INTRODUCTION 


Suppose an organization needs to classify a set V of textual documents under classi¬ 
fication scheme C, and suppose that V is too large to be classified manually, so that 
resorting to some form of automated text classification (TC) is the only viable option. 
Suppose also that the organization has strict accuracy standards, so that the level of 
effectiveness obtainable via state-of-the-art TC technology (including any possible im¬ 
provements obtained via active learning) is not sufficient. In this case, the most plau¬ 
sible strategy is to train an automatic classifier $ on the available training data Tr, 
improve it as much as possible (e.g., via active learning), classify V by means of 4>, and 
then have a human editor validate (i.e., inspect and correct where appropriate) the re¬ 
sults of the automatic classification. The human annotator will validate only a subset 
V c V, e.g., until she is confident that the overall level of accuracy of D is sufficient, 
or until she runs out of time. We call this scenario semi-automated text classification 
(SATC). 


An automatic TC system may support this task by ranking, after the classification 
phase has ended and before validation begins, the classified documents in such a way 
that, if the human annotator validates the documents starting from the top of the 
ranking, the expected increase in classification effectiveness that derives from this 
validation is maximized. This paper is concerned with devising good ranking strategies 
for this task. 

One obvious strategy (also used in [Martinez-Alvarez et al. 2012)) is to rank the 
documents in ascending order of the confidence scores generated by 4>, so that the top- 
ranked documents are the ones that <1 has classified with the lowest confidence. The 
rationale is that an increase in effectiveness can derive only by validating misclassi- 
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fied documents, and that a good ranking method is simply the one that top-ranks the 
documents with the highest probability of misclassification, which (in the absence of 
other information) we may take to be the documents which <1 has classified with the 
lowest confidence. 

In this work we show that this strategy is, in general, suboptimal. Simply stated, 
the reason is that the improvements in effectiveness that derive from correcting a 
false positive or a false negative, respectively, may not be the same, depending on 
which evaluation function we take to represent our notion of “effectiveness”. Addition¬ 
ally, the ratio between these improvements may vary during the validation process. In 
other words, an optimal ranking strategy must take into account the above improve¬ 
ments and how these impact on the evaluation function; we will thus look at ranking 
methods based on explicit loss minimization, i.e., optimized for the specific effective¬ 
ness measures used. 

The contributions of this paper are the following. First, we develop new utility- 
theoretic ranking methods for SATC based on the notion of validation gain, defined 
as the improvement in effectiveness that would derive by correcting a given type of 
mistake (i.e., false positive or false negative). Second, we propose a new evaluation 
measure for SATC based on a probabilistic user model, and use it to evaluate our ex¬ 
periments on standard text classification datasets. The results of these experiments 
show that, with respect to the confidence-based baseline method discussed above, our 
ranking methods are substantially more effective. 

The rest of the paper is organized as follows. Section [2] reviews related work, while 
Section [3] sets the stage by introducing preliminary definitions and notation. Section 
0] describes our base utility-theoretic strategy for ranking the automatically labelled 
documents, while in Section [5] we propose a novel effectiveness measure for this task 
based on a probabilistic user model. Section [6] reports the results of our experiments 
in which we test the effectiveness of ranking strategies by simulating the work of a 
human annotator that validates variable-sized portions of the labelled test set. In Sec¬ 
tion [ 7 ] we address a potential problem deriving from the “static” nature of our strat¬ 
egy, by describing a “dynamic” (albeit computationally more expensive) version of the 
same strategy, and draw an experimental comparison between the two. In Section [8] 
we acknowledge the existence of two different ways (“micro” and “macro”) of averaging 
effectiveness results across classes, and show that the methods we have developed so 
far are optimized for macro-averaging; we thus develop and test methods optimized 
for micro-averaged effectiveness. Section [9] concludes by charting avenues for future 
research. 


2. RELATED WORK 

Many researchers have tackled the problem of how to improve on the accuracy de¬ 
livered by an automatic text classifier when this accuracy is not up to the standards 
required by the application (as, e.g., stipulated in a Service Level Agreement). 

A standard response to this problem is to ask human annotators to label additional 
data that can then be used in retraining a (hopefully) more accurate classifier. This can 
be done via the u se of active learning techniques (AL - see e.g., [Ho i et al. 2006t[Tong 
and Roller 20011), i.e., via algorithms that rank unlabelled documents in such a way 
that the top-ranked ones bring about, once manually labelled and used for retraining, 
the highest expected improvement in classification accuracy. Still, the improvement in 
accuracy that can be obtained via active learning is limited: even by using the best ac¬ 
tive learning algorithm, accuracy tends to plateau after a certain number of unlabelled 
documents have been manually annotated. When this plateau is reached, annotating 
more documents will not improve accuracy any further [Settles 20121. Similar consid- 
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erations apply wh en active learning is carried out at the te rm level, rather than at the 
document level [Godbole et al. 2004; Ragh avan et al. 2006| . 

A related resp onse to the sam e prob lem is to use training data cleaning techniques 
(TDC - see e.g., I Brodley and Friedl 1999 |Esuli and Sebastiani 201 3; Fukumoto and 
Suzuki 2004|), i.e., use algorithms that optimize the human annotators efforts at cor- 
recting possible labelling mistakes in the training set. TDC algorithms rank the train¬ 
ing documents in such a way that the top-ranked ones bring about, once their labels 
are manually checked and then used for retraining, the highest expected improvement 
in classification accuracy. In other words, TDC is to labelled training documents what 
AL is to unlabelled ones. Similarly to what happens in active learning, in many ap¬ 
plicative contexts high enough accuracy levels cannot be attained even at the price of 
carefully validating the entire training set for labelling mistakes. 

Yet another response may be the use of some form of weakly supervised learning / 
semi-supervised learning, i.e., of techniques that allow training a classifier when train- 
ing data are few, often leveraging unlabelled data along with the labelled training data 
[Cha pelle et al. 2006||Zhu and Goldberg 2009 1. This solution relies on the fact that un- 
labelled data is often available in large quantities, sometimes even from the same 
source where the training and test data originate. Similarly to the cases of AL and 
TDC, improvements with respect to the results of the purely supervised setting may 
be obtained, but these improvements are going to be limited anyway. 

In conclusion, when the required accuracy standards are high, neither training data 
cleaning, nor active learning, nor weakly supervised / semi-supervised learning, nor a 
combination of them, may suffice to reach up to these standards. In this case, after ei¬ 
ther or all such techniques have been applied, we can only resort to manual validation 
of part of the automatically classified documents by a human annotator. Supporting 
this last phase is the goal of semi-automated text classification. 

All the techniques discussed above are different from SATC, since in SATC we are 
not concerned with improving the quality of the trained classifier. We are instead con¬ 
cerned with improving the quality of the automatically classified test set, typically 
after all attempts at injecting additional quality in the automatic classifier (and in 
the training set) have proved insufficient; in particular, no retraining / reclassification 
phase is involved in SATC. 

Active learning. As remarked above, SATC certainly bears relations to active learn- 


ing. I n both SATC and in the selective sa mpling approach to AL ([Lewis and Catlett 
19941; also known as pool-based approach |McCallum and Nigam 1998) 1), the automat¬ 


ically classified objects are ranked and the human annotator is encouraged to correct 
possible misclassifications by working down from the top of the ranked list. However, 
as remarked above, the goals of the two tasks are different. In active learning we 
are interested in top-ranking the unlabelled documents that, once manually labelled, 
would maximize the information fed back to the learning process, while in SATC we 
are interested in top-ranking the unlabelled documents that, once manually validated, 
maximize the expected accuracy of the automatically classified document set. As a re¬ 
sult, the optimal ranking strategies for the two tasks may be different too. 

Some approaches to AL take into account the costs of misclassification, thus attribut¬ 
ing different levels of importance to different types of error. In [Kapoor et al. 20071 
these costs are embedded into a decision-theoretic framework, which is reminiscent of 
our utility-theoretic framework. A value-of-information criterion is used in order to se¬ 
lect samples which maximize profit, determined by the total risk of classification and 
the total cost of labelling. The total risk is formulated as a utility function in which 
the probability of each classification and the risk associated with it are taken into ac¬ 
count. The concept of risk is reminiscent of the notion of “gain” defined in our utility 
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function (see Section 4.21, but its purpose is to consider the human effort needed in 
correcting a misclassined sample [Vijayanarasimhan and Grauman 2009]. Therefore 
this decision-theoretic strategy is not aimed to directly improve classification accuracy, 
but to minimise the manual work of the annotator, which is quantified by the risk and 
the cost of labelling. 

Semi-automated TC. While AL (and, to a much lesser degree, TDC) have been in¬ 
vestigated extensively in a TC context, semi-automated TC has been fairly neglected 


by the research community. While a number of papers (e.g., ILarkey and Croft 1996 
Sebastiani 2002; Yang and Liu 19991) have evoked the existence of this scenario, we 
are not aware of many published papers that either discuss ranking policies for sup¬ 
porting the human annotator’s effort, or that attempt to quantify the effort needed 
for reaching a desired level of accuracy. For instance, while discussing a system for 
the automatic assignment of ICD9 classes to patients’ discharge summaries, Larkey 
and Croft [1996] say “We envision these classifiers being used in an interactive system 
which would display the 20 or so top ranking [classes] and their scores to an expert 
user. The user could choose among these candidates (...)”, but do not present experi¬ 
ments that quantify the accuracy that the validation activity brings about, or methods 
aimed at optimizing the cost-effectiveness of this activity. 

The recent [Martinez-Alvarez et al. 2012] tackles the related problem of deciding 
when a document is too difficult for automated classification, and should thus be routed 
to a human annotator. However, the method presented in the paper is not applicable to 
our case, since (a) it is undefined for documents with no predicted labels (a fairly fre¬ 
quent case in multi-label TC), and (b) it is undefined when the classification threshold 
is zero (again, a fairly frequent case in modern learning algorithms). 

In a subsequent paper [Martinez-Alvarez et al. 2013], the same authors study a 
family of SATC methods that exploit “document difficulty”, taking into account the 
confidence scores computed by the base classifiers. They also present a comparison 
between the techniques they propose and that presented in an earlier version of the 
present paper [Berardi et al. 2012]; in this comparison, the former are c laimed to out¬ 
perform the latter on the Reuters-21578 dataset discussed in Section [6~4| However, this 
comparison is incorrect since the authors compare the results of their ranking meth¬ 
ods as applied to confidence scores generated by SVMs, with those of the [ Berardi et al. 
2012] ranking method as applied to confidences scores generated by a different learner. 


A correct comparison among ranking methods must instead be carried out by provid¬ 
ing to all methods the same input, i.e., the same confidence scores (whose generation 
is not part of the method itself). The comparison reported in [Martinez-Alvarez et al. 


2013] is incorrect also because it is carried out in terms of the ENER If measure (see 
Section [573]) ; instead, as stated in [Berardi et al. 20121, the measure according to which 
the method of | Berardi et al. 2012| should be evaluated is ENER ^ 1 , and not ENER 


since it is ENERp 1 that that method was optimized for. In Section [8] we will indeed 
present SATC methods optimized for ENER 
An application of the method discussed in Section|7]to performing SATC in a market 
research context is presented in [Berardi et al. 2014 [7 


3. PRELIMINARIES 

Given a set of textual documents V and a predefined set of classes C = {ci,..., c m }, 
(multi-class multi-label) TC is usually defined as the task of estimating an unknown 
target function $ : V x C —► { —1,+1}, that describes how documents ought to be clas¬ 
sified, by means of a function $:DxC-> {—1, +1} called the classified +1 and -1 


i 


Consistently with most mathematical literature we use the caret symbol O to indicate estimation. 
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represent membership and non-membership of the document in the class. Here, “multi¬ 
class” means that there are m > 2 classes, while “multi-label” refers to the fact that 
each document may belong to zero, one, or several classes at the same time. Multi-class 
multi-label TC is usually accomplished by generating m independent binary classifiers 
<l> ; , one for each c :i £ C, each entrusted with deciding whether a document belongs or 
not to a class Cj. In this paper we will actually restrict our attention to classifiers 4>? 
that, aside from taking a binary decision D i:t £ { -1, +1} on a given document d t , also 
return a confidence estimate Cij, i.e., a numerical value representing the strength of 
their belief in the fact that Dij is correct (the higher the value, the higher the confi¬ 
dence). We formalize this by taking a binary classifier to be a function 4> 7 - : V —> R in 
which the sign of the returned value Dij = sgn(J&j(di)) £ {—1, +1} indicates the binary 
decision of the classifier, and the absolute value = |$ 7 (d l )| represents its confidence 
in the decision. 

For the time being we also assume that 


Fi(h(Te)) 


WPj 

21 P, + FPj + FNj 


( 1 ) 


(the well-known harmonic mean of precision and recall) is the chosen evaluation mea¬ 
sure for binary classification, where (Te) indicates the result of applying <!> , to the 
test set Te and TPj, FPj, FNj, TNj indicate the numbers of true positives, false posi¬ 
tives, false negatives, true negatives in Te for class Cj. Note that F\ is undefined when 
TPj = FPj = FNj = 0; in this case we take F\ (4>.,(Te)) = 1, since 'l 1 , has correctly clas¬ 
sified all documents as negative examples. The assumption that F\ is our evaluation 
measure is not restrictive; as will be evident later on in the paper, our methods can be 
customized to any evaluation function that can be computed from a contingency table. 

As a measure of effectiveness for multi-class multi-label TC, for the moment be¬ 
ing we use macro-averaged F-\ (noted Ff 4 ), which is obtained by computing the class- 
specific F\ values and averaging them across all the e, £ C. An alternative way of 
averaging across the classes ( micro-averaged Fi) will be discussed in Section[8] 

In this paper the set of unlabelled documents that the classifier must automatically 
label (and rank) in the “operational” phase will be represented by the test set Te. 


4. A RANKING METHOD FOR SATC BASED ON UTILITY THEORY 
4.1. Ranking by utility 

For the time being let us concentrate on the binary case, i.e., let us assume there is a 
single class Cj that needs to be separated from its complement Cj. The policy we propose 
for ranking the automatically labelled documents in <1^ (Te) makes use of utility theory, 
an extension of probability theory that incorporates the notion of gain (or loss) that 
deriv es from a given course of action [Anand 1993; von Neumann and Morgenstern 
1944|. Utility theory is a general theory of rational action under uncertainty, and as 
such is used in many fields of human activity; for instance, one such field is betting, 
since in placing a certain bet we take into account (a) the probabilities of occurrence 
that we subjectively attribute to a set of outcomes (say, to the possible outcomes of a 
given football game), and (b) the gains or losses that we obtain, having bet on one of 
them, if the various outcomes materialise. 

In order to explain our method let us introduce some basics of utility theory. Given a 
set A = {ai,a 2 , ■ ■ ■} of possible courses of action and a set fi = {wi, w 2 , ...} of mutually 
disjoint events, the expected utility U(ai, fi) that derives from choosing course of action 
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on given that any of the events in ft may occur, is defined as 

U(ai,Cl) = ^2 P{u] k )G{ai,u k ) (2) 

where P(uj k ) is the probability of occurrence of event u k and G(ai,u) k ) is the gain ob¬ 
tained if ai is chosen and event uj k occurs. For instance, a, may be the course of ac¬ 
tion “betting on Arsenal FC’s win” and Q may be the set of mutually disjoint events 
fl = {wi, u> 2 , w 3 }, where wi=“Arsenal FC wins”, w 2 =“Arsenal FC and Chelsea FC tie”, 
and cu 3 =“Chelsea FC wins”; in this case, 

— -P(wi), P(w 2 ), P(w 3 ) are the probabilities of occurrence that we subjectively attribute 
to the three events u>i, w 2 , w 3 ; 

— G(«i, wi), G (a,, oj 2 ), 6’(a:, , w 3 ) are the economic rewards we obtain if we choose course 
of action ai (i.e., we bet on the win of Arsenal FC) and the respective event occurs. Of 
course, this economic reward will be positive if wi occurs and negative if either cu 2 or 
uj:> occur. 

When we face alternative courses of action, acting rationally means choosing the 
course of action that maximises our expected utility. For instance, given the alternative 
courses of action ai=“betting on Arsenal FC’s win”, a 2 =“betting on Arsenal FC’s and 
Chelsea FC’s tie”, u 3 — betting on Chelsea FC s win , we should pick among {oa, u 2 , o: 3 } 
the course of action that maximises U (a,, Q). 

How does this translate into a method for ranking automatically labelled docu¬ 
ments? Assume we have a set D = d n } of such documents that we want to 

rank, and that Cj is the class we deal with. For instantiating Equation ?? concretely 
we need 


(1) to decide what our set A = {ai, a 2 ,...} of alternative courses of action is; 

(2) to decide what the set f1 = {wi, w 2 ,...} of mutually disjoint events is; 

(3) to define the gains G{ai,cu k ); 

(4) to specify how we compute the probabilities of occurrence P(to k ). 

Let us discuss each of these steps in turn. 

Concerning Step [lj we will take the action of validating document di as course of 
action a t . In this way we will evaluate the expected utility fl) (i.e., the expected 

increase in the overall classification accuracy of Te) that derives to the classification 
accuracy of class Cj from validating each document di, and we will be able to rank the 
documents by their Uj {di , f>) value, so as to top-rank the ones with the highest expected 
utility. 

Concerning Step [2] we have argued in the introduction that the increase in accuracy 
that derives from validating a document depends on whether the document is a true 
positive, a false positive, a false negative, or a true negative; as a consequence, we 
will take Q = { dpj, f p :l . f n.j , tn :l \ , where each of these events implicitly refers to the 
document d, under scrutiny (e.g., tp. } denotes the event “document di is a true positive 
for class c”). Our utility function has thus the form 


Uj{di,ti) = ^2 P{uj k )G(di,uj k ) 

UkeitPjJPjJrijMj} 


(3) 


How to address Step [3] (defining the gains) will be the subject of Sections 4.2 and 4.3 


whi le Step [4] (computing the probabilities of occurrence) will be discussed in Section 
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4.2. Validation gains 

We equate GUI, , fpj ) in Equation ?? with the average increase in F\ (<lj (Te) ) that would 
derive by manually validating the label attributed by <k, to a document d, in FPj . We 
call this the validation gain of a document in FPj. Note that validation gains are 
independent of a particular document, i.e., G(dU fpj) = G(d",fPj) for all d',d" £ Te. 
Analogous arguments apply to G(di,tpj), G(di, fnj), and G(di, tn/). 

Quite evidently, G(d i7 tPj) = G(d l7 trij) = 0, since when the human annotator vali¬ 
dates the label attributed to d, by <I>, and finds out it is correct, she will not modify it, 
and the value of fq (4>,-(Te)) will thus remain unchanged. 

Concerning misclassified documents, it is easy to see that, in general, G(d,, fpj) / 
G(di, fnj). In fact, if a false positive is corrected, the increase in F\ is the one deriving 
from removing a false positive and adding a true negative, i.e., 

G(di,fPj) = j^(Fr P ^j(Te)) - Fi(<tj(Te))) 

1 (4) 

1 2TP j 277 j 

~ ~FPj 2TPj + FNj ~ 2 TPj + FP 3 + FNj ' 

where by Ff p (<&j) we indicate the value of T\ that would derive by correcting all false 
positives of &j(Te), i.e., turning all of them into true negatives. Conversely, if a false 
negative is corrected, the increase in F\ is the one deriving from removing a false 
negative and adding a true positive, i.e., 

G(di,f nj ) = ^(F PN (h(Te)) - Fi(fhj(Te))) 

i (51 

1 2 (TP J+ FNj) _ 2TP, 

I X ,' 2 (TPj +FNj) + FPj 2 TPj + 77’, + FNj 

where by F PN ($j) we indicate the value of 7) that would derive by turning all the 
false negatives of (Te) into true positives. 

Equation ?? defines the gain deriving from the correction of a false positive as the 
average across the gains deriving from the correction of each false positive in the con¬ 
tingency table (and analogously for Equation ??). The advantage of such a definition is 
that such average gain can be computed once for all during the entire process. We will 
see a different definition, leading to a different SATC method, in Section [7] 

4.3. Smoothing contingency cell estimates 

One problem that needs to be tackled in order to compute G(d i7 fpj) and G[d i7 fnj) 
is that the contingency cell counts TPj, FPj, FNj are not known (since in operational 
settings we do not know which test documents have been classified correctly and which 
have been instead misclassified), and thus need to be estimate*^] In order to estimate 
them we make the assumption that the training set and the test set are independent 
and identically distributed. We then perform a fc-fold cross-validation (fc-FCV) on the 
training set: if by TPj r we denote the number of true positives for class c ; result¬ 
ing from the fc-fold cross-validation on Tr, the maximum-likelihood estimate of TPj is 

TPj = TPj r ■ \Te\/\Tr\; same for FP ;j and FNj 2 3 


2 We will disregard the estimation of TNj since it is unnecessary for our purposes, given that Fi(<f>j(Te)) 
does not depend on TNj. 

3 As in many other contexts, the assumption that the training set and the test set are independent and iden¬ 
tically distributed may not be verified in practice; if it is not, in our case this leads to imprecise estimates of 
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However, these maximum-likelihood cell count estimates need to be smoothed, so as 
to avoid zero counts. In fact, if TPj = 0 it would derive from Equation ?? that there is 
nothing to be gained by correcting a false positive, which is counterintuitive. Similarly, 
if FPj = 0 the very notion of F( r ( < t ) / ) would be meaningless, since it does not make 
sense to speak of “removing a false positive” when there are no false positives; and the 

„ - ML 

same goes lor FAT . 

» . 1 " ML " ML ~ ML 

A second reason why TP ■ , FP^ , FAT need to be smoothed is that, when 

\Te\/\Tr\ < 1, they may give rise to negative values for G(d t , fpj) and G{d,, fn :l ), which 

, . , ML ~ ML - ML , 

is counterintuitive. To see this, note that TP ■ , FP J , FN • may not be integers 

(which is not bad per se, since the notions of precision, recall, and their harmonic mean 
intuitively make sense also when we allow the contingency cell counts to be nonneg¬ 
ative reals instead of the usual integers), and may be smaller than 1 (this happens 
when |Te|/|Tr| < 1). This latter fact is problematic, both in theory (since it is mean¬ 
ingless to speak of, say, removing a false positive from Te when “there are less than 1 
false positives in Te”) and in practice (since it is easy to verify that negative values for 
G(di, fpj) and G(d i: frij) may derive). 

S moothing has extensively b een studied in language modelling f or speech process - 
ing [Chen and Goodman 19961 and for ad hoc search in IR [Zhai and Lafferty 20041. 
However, the present context is slightly different, in that we need to smooth contin- 
gency tables, and not (as in the cases above) language models. In particular, while 

, - ML ~ ML ~ ML 

the TP J , FP ] , and FAT are the obvious counterparts of the document model 
resulting from maximum-likelihood estimation, there is no obvious counterpart to the 
“collection model”, thus making the use of, e.g., Jelinek-Mercer smoothing problem¬ 
atic. A further difference is that we here require the smoothed counts not only to be 
nonzero, but also to be > 1 (a requirement not to be found in language modelling). 

Smoothing has also been studied specifically for the purpose of smoothing contin¬ 
gency cell estimates [Burman 1987; Simonoff 19831. However, these methods are inap¬ 
plicable to our case, since they were originally conceived for contingency tables char¬ 
acterized by a small (i.e., < 1) ratio between the number of observations (which in our 
case is |Te|) and the number of cells (which in our case is 4); our case is quite the oppo¬ 
site. Additionally, these smoothing methods do not operate under the constraint that 
the smoothed counts should all be > 1, which is a hard constraint for us. 

For all these reasons, rather than adopting more sophisticated forms of smoothing, 
we adopt simple additive smoothing (also known as Laplace smoothing), a special case 
of Bayesian smoothing using Dirichlet priors [Zhai and Lafferty 20041 which is ob- 


Knr 


TWT ML , 

FP , , FAT . As a fixed quantity 


tained by adding a fixed quantity to each of TP J , . . y , . .. ;y 
we add 1, since it is the quantity that all our cell counts need to be greater than or 
equal to for Equations ?? and ?? to make sense. We denote the resulting estimates by 

TP^ a , PP'j", FAT ". As it will be clear in Section |g] and following, this simple form of 
smoothing proves almost optimal, which seems to indicate that there is not much to be 
gained by applying more sophisticated smoothing methods to our problem context. 


the contingency cell counts. While this may be suboptimal, there is practically nothing that we can do about 
it, since we do not know the real values of these counts; in other words, fc- FCV is our “best possible shot” at 
estimating them in the absence of foreknowledge. As discussed in Section |6.5| we will exactly measure how 
suboptimal using k -FCV is, by running experiments in which an oracle feeds our utility-theoretic method 
with the true values of the contingency cells. 
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Note that we apply smoothing in an “on demand” fashion, i.e., we check if the con- 

~ ML - ML - ML . 

tingency table needs smoothing at all (i.e., if any of TP ■ , FPj , FN ■ is < 1) and 

we smooth it only if this is the case. The reason why we adopt this “on-demand” policy 
will be especially apparent in Section [7| 


4.4. Turning confidence scores into probabilities 

We derive the probabilities P(wfc) in Equation ?? by assuming that the confidence 
scores C i:i generated by F, can be trusted (i.e., that the higher Cij, the higher the 
probability that D tJ is correct), and by applying to C tJ a generalized logistic function 

f{z) = e az /(e CT2 + 1). This results in 


P(fPj\ D ij = +1) 
p (f n j\Dij = -1) 


gCrCij 
e crCij \ 

e aCij 

e &Cij i 


( 6 ) 


The generalized logistic function (see Figure [T]> has the effect of monotonically con¬ 
verting scores ranging on (— oo, +oo) into real values in the [0.0,1.0] range (hence the 
probabilities of Equation ?? range on [0.0,0.5]). When Ca = 0 (this happens when 
has no confidence at all in its own decision Dif), then 


P(tPj\Dij = +1) = P(fPj\D tj = +1) = 0.5 
P(fn 0 \Dij = -1) = P(tnj\Dij = -1) = 0.5 


i.e., the probability of correct classification and the probability of misclassification are 
identical. Conversely, we have 


lim P{fp j \D ij = +l) = Q 

Cij—>+oo 

lim P(fnj\Dij = —1) = 0 


i.e., when has a very high confidence in its own decision D tJ , the probability that 
TJjj is wrong is taken to be close to 0. 

The reason why we use a generalized version of the logistic function instead of its 
non-parametric version (which corresponds to the case a = 1) is that using this lat¬ 
ter within Equation ?? would give rise to a very high number of zero probabilities 
of misclassification, since the non-parametric logistic function converts every positive 
number above a certain threshold (« 36) to a number that standard implementations 
round up to 1 even by working in double precision. By tuning the a parameter (the 
growth rate) we can tune the speed at which the right-hand side of the sigmoid asymp¬ 
totically approaches 1, and we can thus tune how evenly Equation ?? distributes the 
confidence values across the [0.0,0.5] interval. 

The process of optimizing a within Equation ?? is usu ally called probability calibra¬ 
tion. How we actually optimize a is discussed in Section [6H| 


4.5. Ranking by total utility 


Our function Uj(di, O) of Section 4.1 is thus obtained by plugging Equations ?? and 
?? into Equation ??. Therefore, we are now in a position to compute, given an auto¬ 
matically classified document di and a class cy, the utility, for the aims of increasing 


F 1 ($j(Te)), of manually validating the label IJ, j attributed to d, by 'l' ; . 

Now, let us recall from Section [3] that our goal is addressing not just the binary, 
but the multi-class multi-label TC case, in which binary classification must be accom- 
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Fig. 1. The generalized logistic function. 


plished simultaneously for \C\ > 2 different classes. It might seem sensible to propose 
ranking, for each c, £ C, all the automatically labelled documents in Te in decreasing 
order of their Uj(di, fl) value. Unfortunately, this would generate \C\ different rankings, 
and in an operational context it seems implausible to ask a human annotator to scan 
\C\ different rankings of the same document set (this would mean reading the same 
document \C\ times in order to validate its labels). As suggested in JEsuli and Sebas- 
tiani 20091 for active learning, it seems instead more plausible to generate a single 
ranking, according to a score that is a function of the |C| different Uj(di,tt) 

scores. In such a way, the human annotator will scan this single ranking from the 
top, validating all the |C| different labels for d, before moving on to another document. 
As the criterion for generating the overall utility score U(di,fl) we use total utility, 
corresponding to the simple sum 



(9) 


CjGC 


Our final ranking is thus generated by sorting the test documents in descending order 
of their U(d i7 fl) score. 

From the standpoint of computational cost, this technique is 0(|Te| ■ (|C| + log |Te|)), 
since the cost of sorting the test documents by their score is 0(|Te| log |Te|), and 

the cost of computing the U (•, 0) score for \Te\ documents and \C\ classes is 0(\Te\ ■ |C|). 

5. EXPECTED NORMALIZED ERROR REDUCTION 

No measures are known from literature for evaluating the effectiveness of a SATC- 
oriented ranking method p. We here propose such a measure, which we call ex¬ 
pected normalized error reduction (denoted ENER P ). In this section we will introduce 
ENER P in a stepwise fashion. 

5.1. Error reduction at rank 

Let us first introduce the notion of residual error at rank n (noted E p (n)), defined as 
the error that is still present in the document set Te after the human annotator has 
validated the documents at the first n rank positions in the ranking generated by p. 
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Inspection Depth Inspection Depth 


Fig. 2. Error reduction, measured as ER^f , as a function of validation depth. The dataset is REUTERS- 
21578, the learners are MP-BOOST (left) and SVMs (right). The Random curve indicates the results of our 
estimation of the expected ER of the random ranker via a Monte Carlo method with 100 random trials. 
Higher curves are better. 


The value of E p ( 0) is the initial error generated by the automated classifier, and the 
value of E p (\Te\) is 0. We assume our measure of error to range on [0,1]; if so, E p (n) 
ranges on [0,1] too. We will hereafter call n the validation depth (or inspection depth). 
We next define error reduction at rank n to be 


ER p {n) 


E P { 0) ~ E p (n ) 
E P { 0) 


( 10 ) 


i.e., a value in [0,1] that indicates the error reduction obtained by a human annotator 
who has validated the documents at the first n rank positions in the ranking generated 
by p; 0 stands for no reduction, 1 stands for total elimination of error. 

Example plots of the ER p (n) measure are displayed in Figure [2| where different 
curves represent different ranking methods p' ,p" and where, tor better conve¬ 
nience, the x axis indicates the fraction n/|Te| of the test set that has been validated 
rather than the number n of validated documents. By definition all curves start at the 
origin of the axes (i.e, if the annotator validates 0 test documents, no error reduction is 
obtained) and end at the upper right corner of the graph (i.e., if the annotator validates 
all the |Te| test documents, a complete elimination of error is obtained). More convex 
(i.e., higher) curves represent better strategies, since they indicate that a higher error 
reduction is achieved for the same amount of manual validation effort. 

The reason why we focus on error reduction, instead of the complementary concept 
of “increase in accuracy”, is that error reduction has always the same upper bound 
(i.e., 100% reduction), independently of the initial error. In contrast, the increase in 
accuracy that derives from validating the documents does not always have the same 
upper bound. For instance, if the initial accuracy is 0.5, if we assume that accuracy 
values range on [0,1] then an increase in accuracy of 100% is indeed possible, while this 
increase is not possible if the initial accuracy is 0.9. This makes the notion of “increase 
in accuracy” less immediately interpretable, since different datasets and/or different 
classifiers give rise to different initial levels of accuracy. So, using “error reduction” 
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instead of “increase in accuracy” makes our curves more immediately interpretable, 
since error reduction has the same range (i.e., [0,1]) irrespectively of dataset used 
and/or initial classifier used. 

Since (as stated in Section [3]> we use F\ for measuring effectiveness, as a measure of 
classification error we use E 1 = (1 — F \), which indeed (as assumed at the beginning of 
this section) ranges on [0,1]. In order to measure the overall effectiveness of a ranking 
method across the entire set C of classes, we compute macro-averaged E\ (noted // lf ), 
obtained by computing the class-specific Ei values and averaging them across the c/s; 
from this it derives that Ef 1 = 1 — Ff 4 . By ER^{n) we will indicate macro-averaged 
ER p [n), also obtained by computing the class-specific ER p (n ) values and averaging 
them across the c/s. 


5.2. Normalized error reduction at rank ... 

One problem with ER p (n), though, is that the expected ER p (n) value of the random 
ranker is fairly higbj^J since it amounts to yf—\. The difference between the ER p {n) 

value of a genuinely engineered ranking method p and the expected ER p (n) value of 
the random ranker is particularly small for high values of n, and is null for n = \Te\. 
This means that it makes sense to factor out the random factor from ER p {n ). This 
leads us to define the normalized error reduction of ranking method p as NER p {n ) = 
ER p [n) — //, with macro-averaged NER p {n) obtained as usual and denoted, as usual, 

by NERf{n). 


5.3. ... and its expected value 

However, NER p (n ) is still unsatisfactory as a measure, since it depends on a spe¬ 
cific value of n (which is undesirable, since our human annotator may decide to work 
down the ranked list as far as she deems suitable). Following [Robertson 2008| we as¬ 
sume that the human annotator stops validating the ranked list at exactly rank n with 
probability P s (n) (the index s stands for “stoppage”). We can then define the expected 
normalized error reduction of ranking method p on a given document set Te as the 
expected value of NER p (n ) according to probability distribution P s (n), i.e., 


\Te\ 

ENER p = Ps(n)NER p (n ) ( 11 ) 

n =1 


with macro-averaged ENER P indicated, as usual, as ENER^f. 

Different probability distributions P s (n) can be assumed. In order to base the def¬ 
inition of such a distribution on a plausible model of user behaviour, we here make 
the assumption (along with [Moffat and Zobel 20081) that a human annotator, after 
validating a document, goes on to validate the next document with probability (or per¬ 
sistence [Moffat and Zobel 2008]) p or stops validating with probability (1 — p), so that 


Ps(n) = 


P n X (1 ~p) if n e {1,..., |Te| - 1} 
p n_1 if n = \Te\ 


( 12 ) 


It can be shown that, for a sufficiently large value of |Te|, Yn=[ n ' Ps(n) (the expected 
number of documents that the human annotator will validate as a function of p) asymp- 


4 That the expected ER p (n) value of the random ranker is //y is something that we have not tried to 
formally prove. However, that this holds is supported by intuition and is unequivocally shown by Monte 
Carlo experiments we have run on our datasets; see Figures[2]to[4]for a graphical representation. 
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totically tends to . The value £ = yrkzn^pj l^us denotes the expected fraction of the 
test set that the human annotator will validate as a function of p. 

Using this distribution in practice entails the need of determining a realistic value 
for p. A value p = 0 corresponds to a situation in which the human annotator only 
validates the top-ranked document, while p = 1 indicates a human annotator who val¬ 
idates each document in the ranked list. Unlike in ad hoc search, we think that in a 
SATC context it would be unrealistic to take a value for p as given irrespective of the 
size of Te. In fact, given a desired level of error reduction, when |Te| is large the hu¬ 
man annotators need to be more persistent (i.e., characterized by higher p ) than when 
|Te| is small. Therefore, instead of assuming a predetermined value of p we assume 
a predetermined value of £, and derive the value of p from the equation £ = Te | ( ' | 

For example, in a certain application we might assume £ = .20 (i.e., assume that the 
average human annotator validates 20% of the test set). In this case, if |Te| = 1000, 
then p = 1 — 20 .ionn = -9950, while if |Te| = 10,000, then p = 1 — 2010000 = -9995. In 
the experiments of Section [6] we will test all values of p corresponding to values of £ in 
{.05, .10, .20}. 

Note that the values of ENER P are bound above by 1, but a value of 1 is not at¬ 
tainable. In fact, even the “perfect ranker” (i.e., the ranking method that top-ranks 
all misclassified documents, noted Perf) cannot attain an ENER P value of 1, since in 
order to achieve total error elimination all the misclassified documents need to be val¬ 
idated anyway, one by one, which means that the only condition in which ENERp er f 
might equal 1 is when there is just 1 misclassified document. We do not try to nor¬ 
malize ENER p by the value of ENER Per f since ENER Per f cannot be characterized 
analytically, and depends on the actual labels in the test set. 


6. EXPERIMENTS 

We have now fully specified (Section [4) a method for performing SATC-oriented rank¬ 
ing and (Section [S) a measure for evaluating the quality of the produced rankings, so 
we ar e no w in a position to test the effectiveness of our proposed m etho d. In Sections 
|6.1| to |6A] we will describe our experimental setting, while in Section [676] we will report 
and discuss the actual results of these experiments. 


6.1. Experimental protocol 

Let fl be a dataset partitioned into a training set Tr and a test set Te. In each experi¬ 
ment reported in this paper we adopt the following experimental protocol: 


(1) For each c, e C 

(a) Train classifier <l> ; on Tr and classify Te by means of <1> 7 

(b) Run fc-fold cross-validation on Tr, thereby 


( 2 ) 


i. computing TP 2 


Tr 


ppTr 
3 ’ 


and FNj r 


ii. optimizing the a parameter of Equation ?? (see Section 6.2 below for the 
actual optimization method used); 

For every ranking policy p tested 

(a) Rank Te according to p; 

(b) Scan the ranked list from the top, correcting possible misclassifications and 
computing the resulting values of ENER p J for different values of £. 


For Step |lb] we have used k = 10; we think this value guarantees a good tradeoff 
between tne accuracy of the parameter estimates (which tends to increase with k) and 
the cost of computing these estimates (which also increases with k). 
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6.2. Probability calibration 

We optimize the a parameter by picking the value of er that minimizes the average 
(across the c, g C) absolute value of the difference between PosJ r , the number of posi¬ 
tive training examples of class c,-, and E [PosJ r ], the expected number of such examples 
as resulting from the probabilities of membership in Cj computed in the A:-fold cross- 
validation. That is, we pool together all the training documents classified in the A:-fold 
cross-validation phase, and then we pick 


argmin — ^ | PosJ r - E [P, 

1 1 Cj&C 

1 


o\= 


argmm- \PosJ r - P(cj\di)\ = 


c^eC 


diSTr 


(13) 


1 ___ __ „cr4>j(d;) 

argm i n |cj 51 \ Pos J r ~ E 


c^eC 


diGTr 


e <r$j(di) _(_ i 


This method is a much faster calibration method than the traditional method of pick¬ 
ing the value of a that has performed best in A;-fold cross-validation)^] In fact, unlike the 
latter, it does not depend on the ranking method p. Therefore, this method spares us 
from the need of ranking the training set several times, i.e., once for each combination 
of a tested value of a and a ranking method p. 


6.3. Learning algorithms 

As our first learning algorithm for generating our classifiers A> ( we use a boosting- 
based learner called MP-Boost [Esuli et al. 2006]. Boosting-based methods have 
shown very good performance across many learning tasks and, at the same time, have 
strong justificati ons from computational le arning theory. MP-BOOST is a variant of 
ADABOOST.MH (Schapire and Singer 2000J optimized for multi-label settings, which 
has been shown in~ |Esuli et al. 2006J to obtain considerable effectiveness improve¬ 
ments with respect to ADABOOST.MH. In all our experiments we set the S parameter 
of MP-BOOST (representing the number of boosting iterations) to 1000. 

As the second learning algorithm we use support vector machines (SVMs). We use 
the implementation from the freely available LibSvm library]^] with a linear kernel 
and parameters at their default values. 

In all the experiments discussed in this paper stop words have been removed, punc¬ 
tuation has been removed, all letters have been converted to lowercase, numbers have 
been removed, and stemming has been performed by means of Porter’s stemmer. Word 
stems are thus our indexing units. Since MP-BOOST requires binary input, only their 
presence/ absence in the document is recorded, and no weighting is performed. Docu¬ 
ments are instead weighted (by standard cosine-normalized tfidf) for the SVMs exper¬ 
iments. 


6.4. Datasets 

Our first dataset is the Reuters-21578 corpus. It consists of a set of 12,902 news 
stories, partitioned (according to the standard “ModApte” split we have adopted) into 
a training set of 9603 documents and a test set of 3299 documents. The documents 


5 This method is s ometimes called Platt calibration (see e.g., | Niculescu- Mizil and Caruana 2005)), due its 
use in IPlatt 2000|. However, the method was in use well before Platt’s article (see e.g., |Ittner *et al. 1995] 
Section 2.3J). 

£ http://www.csie.ntu.edu.tw/~cjlinAibsvm/ 
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Table I. Characteristics of the test collections used. From left to right we report the number of 
test sets |T| (Column 2) and, for each test set, the number of training documents \Tr\ (3), the 
number of test documents |Te| (4), the number of classes |C| (5), and the average number of 
classes per test document ACD (6). Columns 7-10 report the initial error (both Ef 1 and Ef) 
generated by the MP-Boost and SVMs classifiers. 


Dataset 

|T| 

|Tr| 

|Te| 

|C| 

ACD 

Em 

Ef 

MP-B 

SVMs 

MP-B 

SVMs 

Reuters-21578 

1 

9603 

3299 

115 

1.135 

.392 

.473 

.152 

.140 

Reuters-21578/10 

10 

9603 

330 

115 

1.135 

.194 

.199 

.151 

.130 

Reuters-21578/100 

100 

9603 

33 

115 

1.135 

.050 

.049 

.149 

.140 

OHSUMED 

1 

183229 

50216 

97 

0.132 

.553 

.577 

.389 

.324 

OHSUMED-S 

1 

12358 

3584 

97 

1.851 

.520 

.522 

.286 

.244 


are labelled by 118 categories; the average number of categories per document is 1.08, 
ranging from a minimum of 0 to a maximum of 16; the number of positive examples 
per class ranges from a minimum of 1 to a maximum of 3964. In our experiments we 
have restricted our attention to the 115 categories with at least one positive train¬ 
ing example. This dataset is publicly available^ and is probably the most widely used 
benchmark in text classification research; this fact allows other researchers to easily 
replicate the results of our experiments. 

Another dataset we have used is OHSUMED [Hersh et al. 19941, a test collection 
consisting of a set of 348,566 MEDLINE references spanning the years from 1987 to 
1991. Each entry consists of summary information relative to a paper published on one 
of 270 medical journals. The available fields are title, abstract, MeSH indexing terms, 
author, source, and publication type. Not all the entries contain abstract and MeSH 
indexing terms. In our experiments we have scrupulously followed the experimental 
setup presented in [Lewis et al. 19961. In particular, (i) we have used for our experi¬ 
ments only the 233,445 entries with both abstract and MeSH indexing terms; (ii) we 
have used the entries relative to years 1987 to 1990 (183,229 documents) as the train¬ 
ing set and those relative to year 1991 (50,216 documents) as the test set; (iii) as the 
categories on which to perform our experiments we have used the main heading MeSH 
index terms assigned to the entries. Concerning this latter point, we have restricted 
our experiments to the 97 MeSH index terms that belong to the Heart Disease (HD) 
subtree of the MeSH tree, and that have at least one positive training example. This 
is the only point in which we deviate from [Lewis et al. 19961, which experiments only 
on the 77 most frequent MeSH index terms of the HD subtree. 

The main characteristics of our datasets, and of three variants (called REUTERS- 
21578/10, Reuters-21578/100, and OHSUMED-S) that will be discussed in Section 
6.6 are conveniently summarized in Table |T[ 


6.5. Lower bounds and upper bounds 

As the baseline for our experiments we use the confidence-based strategy discussed in 
Section[T] which corresponds to using our utility-theoretic method with both G( f p) and 
G(fn ) set to 1. As discussed in Footnote ??, while this strategy has not (to the best of 
our knowledge) explicitly been proposed before, it seems a reasonable, common-sense 
strategy anyway. 


'http://www.daviddlewis.com/resources/testcollections/~reuters21578/ 
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While the confidence-based method will act as our lower bound, we have also run 
“oracle-based” methods aimed at identifying upper bounds for the effectiveness of our 
utility-theoretic method, i.e., at assessing the effectiveness of “idealized” (albeit non- 
realistic) systems at our task. 

The first such method (dubbed Oracle"!) works by “peeking” at the actual values of 
TPj, FPj, FNj in Te, using them in the computation of G(d,, fpj) and G(d,, frij), and 
applying our utility-theoretic method as usual. Oracle"! thus indicates how our method 
would behave were it able to “perfectly” estimate TPj, FPj, and FNj. The difference in 
effectiveness between Oracle"! and our method will thus be due to (i) the performance 
of the method adopted for smoothing contingency tables, and (ii) possible differences 
between the distribution of the documents across the contingency table cells in the 
training and in the test set. 

In the second such method (Oracle2) we instead peek at the true labels of the docu¬ 
ments in Te, which means that we will be able to (a) use the actual values of TPj, FPj, 
FNj in the computation of G{di,fpj) and G(di , fn.j ) (as in Oracle"! ), and (b) replace the 
probabilities in Equation ?? with the true binary values (i.e., replacing P(x) with 1 if 
x is true and 0 if x is false), after which we apply our utility-based ranking method as 
usual. The difference in effectiveness between Oracle2 and our method will be due to 
factors (i) and (ii) already mentioned for Oracle"! and to our method’s (obvious) inability 
to perfectly predict whether a document was classified correctly or not. 


6.6. Results and discussion 


The results of our experiments are given in Table |TT| where we present the re¬ 
sults of running, for each of two learners (MP-Boost and SVMs) and five datasets 
(REUTERS-21578, OHSUMED, and three variants of them - called REUTERS- 
21578/10, Reuters-21578/100, OHSUMED-S - that we will introduce in Sections 
6.6.2| |6.6.3l |6.6.4|l, our utility-theoretic method against the three methods discussed 


in Section 6.5 in Table [TT] our method, Oracle"! and Oracle2 are actually indicated as 
U-Theoretic(s), Oracle"! (s)and Oracle2(s), to distinguish them from variants (indicated 
as U-Theoretic(d), Oracle"! (d) and Oracle2(d)) that will be described in Section|7] Table 
pllpresents ENER 1 ^ (£) values for three representative values of £, i.e., 0.05, 0.10, and 
TT20. 


For each of two learners and five datasets, and for each pairwise combination of all 
the methods discussed (including those we will discuss in Section [7j, we have run a 
paired t-test with ENER^f (0.10) as the evaluation measure and 0.05 as the signifi¬ 
cance level, in order to determine whether the difference in performance between the 
two methods is statistically significant. The results of such tests are reported in Table 

uni 


6.6.1. Mid-sized test sets. Figurepd plots the results, in terms of ER^(n), of our experi¬ 
ments with the MP-BOOST ana SVM learners on the Reuters-21578 dataset. The 
results of these experiments in terms of ENER ^ as a function of the chosen value of 
£ are instead reported in Table [II| The optimal value of a returned by the A:-fold cross- 
validation phase is .554 for MP-BOOST and 7.096 for SVMs; these values, sharply dif¬ 
ferent from 1 and from each other, clearly show the advantage of converting confidence 
scores into probabilities via a generalized logistic function. 

The first insight we can draw from these results is that our U-Theoretic(s) m etho d 
outperforms Baseline in a very substantial way (the paired t-test - see Table [TTT| - 
indicates that this difference is statistically significant). This can be appreciated both 
from the plots of Figures [2j in which the red curve (corresponding to U-Theoretic(s)) 
is markedly higher than the green curve (corresponding to Baseline), and from Table 
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Table II. Results of various ranking methods, applied to two learning algorithms and several test collections, in 
terms of ENER.% 1 (£), for £ e {0.05,0.10,0.20}. Improvements listed for the various methods are relative to the 
baseline. 




MP-Boost 

SVMs 




= 0.05 

f 

= 0.10 

£ 

= 0.20 


= 0.05 


= 0.10 


= 0.20 


Baseline 

.071 


.108 


.152 

.262 

.352 

.420 

00 

U-Theoretic(s) 

.163 

(+128%) 

.226 

(+109%) 

.280 

(+84%) 

.442 

(+69%) 

.531 

(+51%) 

.562 

(+34%) 

CM 

U-Theoretic(d) 

.160 

(+124%) 

.224 

(+107%) 

.279 

(+84%) 

.431 

(+65%) 

.523 

(+49%) 

.557 

(+33%) 

m 

PS 

w 

£ 

Oracle l(s) 

.155 

(+117%) 

.222 

(+106%) 

.280 

(+84%) 

.477 

(+82%) 

.563 

(+60%) 

.586 

(+40%) 

Oraclel(d) 

.152 

(+113%) 

.219 

(+103%) 

.275 

(+81%) 

.476 

(+82%) 

.567 

(+61%) 

.592 

(+41%) 

Pi 

Oracle2(s) 

.693 

(+869%) 

.738 

(+583%) 

.707 

(+365%) 

.719 

(+174%) 

.760 

(+116%) 

.723 

(+72%) 


Oracle2(d) 

.677 

(+847%) 

.725 

(+571%) 

.699 

(+360%) 

.723 

(+176%) 

.763 

(+117%) 

.724 

(+72%) 

o 

Baseline 

.063 

.097 

.135 

.243 

.322 

.383 

I—1 

00 

U-Theoretic(s) 

.145 

(+131%) 

.203 

(+110%) 

.245 

(+81%) 

.330 

(+36%) 

.415 

(+29%) 

.465 

(+21%) 

LO 

1—1 

U-Theoretic(d) 

.139 

(+121%) 

.198 

(+105%) 

.239 

(+77%) 

.335 

(+38%) 

.420 

(+30%) 

.470 

(+23%) 

co 

Oracle l(s) 

.159 

(+153%) 

.205 

(+112%) 

.243 

(+80%) 

.392 

(+61%) 

.482 

(+50%) 

.522 

(+36%) 

w 

Oraclel(d) 

.158 

(+152%) 

.212 

(+119%) 

.255 

(+89%) 

.394 

(+62%) 

.488 

(+52%) 

.531 

(+39%) 

w 

Pi 

Oracle2(s) 

.555 

(+784%) 

.643 

(+566%) 

.648 

(+380%) 

.596 

(+145%) 

.676 

(+110%) 

.672 

(+75%) 

Oracle2(d) 

.558 

(+789%) 

.648 

(+571%) 

.654 

(+384%) 

.599 

(+147%) 

.679 

(+111%) 

.675 

(+76%) 

o 

Baseline 

.069 

.121 

.164 

.226 

.302 

.364 

T—1 

U-Theoretic(s) 

.118 

(+71%) 

.172 

(+42%) 

.215 

(+31%) 

.291 

(+29%) 

.365 

(+21%) 

.416 

(+14%) 

lO 

U-Theoretic(d) 

.119 

(+72%) 

.176 

(+45%) 

.217 

(+32%) 

.289 

(+28%) 

.367 

(+22%) 

.419 

(+15%) 

<N 

Oraclel(s) 

.192 

(+178%) 

.247 

(+104%) 

.281 

(+71%) 

.318 

(+41%) 

.422 

(+40%) 

.479 

(+32%) 

W 

Oracle 1(d) 

.197 

(+185%) 

.266 

(+120%) 

.318 

(+94%) 

.318 

(+41%) 

.427 

(+41%) 

.489 

(+34%) 

w 

Oracle2(s) 

.429 

(+521%) 

.537 

(+344%) 

.575 

(+251%) 

.458 

(+103%) 

.568 

(+88%) 

.600 

(+65%) 

Pi 

Oracle2(d) 

.429 

(+521%) 

.537 

(+344%) 

.576 

(+251%) 

.458 

(+103%) 

.569 

(+88%) 

.601 

(+65%) 


Baseline 

.385 

.479 

.512 

.526 

.630 

.644 

Q 

U-Theoretic(s) 

.442 

(+15%) 

.529 

(+10%) 

.549 

(+7%) 

.623 

(+18%) 

.685 

(+9%) 

.666 

(+3%) 

W 

U-Theoretic(d) 

.443 

(+15%) 

.531 

(+11%) 

.550 

(+7%) 

.618 

(+17%) 

.676 

(+7%) 

.655 

(+2%) 

£ 

Oracle l(s) 

.445 

(+16%) 

.530 

(+11%) 

.549 

(+7%) 

.639 

(+21%) 

.687 

(+9%) 

.657 

(+2%) 

o 

Oracle 1(d) 

.449 

(+17%) 

.532 

(+11%) 

.550 

(+7%) 

.617 

(+17%) 

.659 

(+5%) 

.636 

(-1%) 


Oracle2(s) 

.838 

(+118%) 

.839 

(+75%) 

.769 

(+50%) 

.864 

(+64%) 

.854 

(+36%) 

.778 

(+21%) 


Oracle2(d) 

.758 

(+97%) 

.762 

(+59%) 

.700 

(+37%) 

.795 

(+51%) 

.787 

(+25%) 

.721 

(+12%) 


Baseline 

.021 

.025 

.026 

.075 

.124 

.164 

CQ 

U-Theoretic(s) 

.087 

(+323%) 

.118 

(+374%) 

.132 

(+402%) 

.212 

(+184%) 

.282 

(+127%) 

.323 

(+97%) 

Q 

W 

U-Theoretic(d) 

.088 

(+329%) 

.118 

(+374%) 

.132 

(+402%) 

.210 

(+182%) 

.280 

(+126%) 

.321 

(+96%) 

S 

£ 

Oracle l(s) 

.091 

(+343%) 

.117 

(+370%) 

.125 

(+375%) 

.272 

(+265%) 

.334 

(+169%) 

.352 

(+115%) 

CO 

►Ej 

Oraclel(d) 

.094 

(+358%) 

.119 

(+378%) 

.128 

(+387%) 

.301 

(+303%) 

.363 

(+193%) 

.380 

(+132%) 

o 

Oracle2(s) 

.481 

(+2246%) 

.554 

(+2125%) 

.572 

(+2075%) 

.511 

(+585%) 

.589 

(+375%) 

.603 

(+268%) 


Oracle2(d) 

.450 

(+2095%) 

.498 

(+1900%) 

.496 

(+1786%) 

.487 

(+553%) 

.540 

(+335%) 

.536 

(+227%) 


[TT| In this latter, for £ = .10 (corresponding to p = .996) our method obtains relative 
improvements over Baseline of +109% (MP-Boost) and +51% (SVMs); for £ = .20 the 
improvements, while not as high as for £ = .10, are still sizeable (+84% for MP-BOOST 
and +34% for SVMs), while for £ = .05 the improvements are even higher than for 
£ = .10 (+128% for MP-BOOST and +69% for SVMs). 
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Table III. Statistical significance results obtained for the two learners (MP-Boost and SVMs) via a 
paired t-test with ENER* 1 (0.10) as the evaluation measure and 0.05 as the significance level. “Y” 
means that there is a statistically significant difference between the two methods, while “N” means 
there is not; each 5-tuple of Y’s and N’s indicates this for the five datasets studied in this paper 
(Reuters-21578, Reuters-21578/10, Reuters-21578/100, OHSUMED, OHSUMED-S, in 
this order). 



Baseline 

U-Theoretic(s) 

U-Theoretic(d) 

Oraclel(s) 

Oraclel(d) 

Oracle2(s) 

Oracle2(d) 

MP-Boost 

Baseline 

— 

YYYYY 

YYYYY 

YYYYY 

YYYYY 

YYYYY 

YYYYY 

U-Theoretic(s) 

YYYYY 

— 

NYNNN 

NNYNN 

YNYNN 

YYYYY 

YYYYY 

U-Theoretic(d) 

YYYYY 

NYNNN 

— 

NNYNN 

NNYNN 

YYYYY 

YYYYY 

Oraclel(s) 

YYYYY 

NNYNN 

NNYNN 

— 

NNYNN 

YYYYY 

YYYYY 

Oraclel(d) 

YYYYY 

YNYNN 

NNYNN 

NNYNN 

— 

YYYYY 

YYYYY 

Oracle2(s) 

YYYYY 

YYYYY 

YYYYY 

YYYYY 

YYYYY 

— 

NNNYN 

Oracle2(d) 

YYYYY 

YYYYY 

YYYYY 

YYYYY 

YYYYY 

NNNYN 

— 

SVMs 

Baseline 

— 

YYYYY 

YYYYY 

YYYYY 

YYYYY 

YYYYY 

YYYYY 

U-Theoretic(s) 

YYYYY 

— 

YNNYN 

YYYNY 

YYYNY 

YYYYY 

YYYYY 

U-Theoretic(d) 

YYYYY 

YNNYN 

— 

YYYNY 

YYYNY 

YYYYY 

YYYYY 

Oraclel(s) 

YYYYY 

YYYNY 

YYYNY 

— 

NNYNY 

YYYYY 

YYYYY 

Oraclel(d) 

YYYYY 

YYYNY 

YYYNY 

NNYNY 

— 

YYYYY 

YYYYY 

Oracle2(s) 

YYYYY 

YYYYY 

YYYYY 

YYYYY 

YYYYY 

— 

YYNNN 

Oracle2(d) 

YYYYY 

YYYYY 

YYYYY 

YYYYY 

YYYYY 

YYNNN 

— 


A second insight is that, surprisingly, our method hardly differs in terms of perfor¬ 
mance from Oraclel (s). The two curves can be barely distinguished in Figure [2] and 
in terms of ENER^ Oraclel (s) is even slightly outperformed, in the MP-BOOST ex¬ 
periments, by U-Theoretic(s) (e.g., .226 vs. .222 for £ = .10); the paired t-test (see Table 
|III[ > indicates that the difference between the two methods is not statistically signifi¬ 
cant. This shows that (at least judging from these experiments) Laplace smoothing is 
nearly optimal, and there is likely not much we can gain from applying alternative, 
more sophisticated smoothing methods. This is sharply different from what happens 
in language modelling, where Laplace smoothing has been shown to be an underper¬ 
former [Gale and Church 19941. The fact that with MP-BOOST our method slightly 
(and strangely) outperforms Oraclel (s) is probably due to accidental, “serendipitous” 
interactions between the probability estimatio n com ponent (Equation ??) and the con¬ 
tingency c ell e stimation component of Section |4.3| in fact, the paired t-test indicates 
(see Table III i that this difference is not statistically significant. 

A third interesting fact is that error reduction is markedly better in the SVM ex¬ 
periments than in the MP-BOOST experiments. This is evident from the fact that 
the Figure [2] curves for SVMs are much more convex (i.e., are higher) and are closer 
to the optimum (i.e., closer to the Oracle2(s) curve) than the corresponding Figure [2] 
curves for MP-BOOST. This fact is also evident from the numerical results reportea 
in Table |ll| where, with U-Theoretic(s), SVMs obtain ENER^ (.10) = .531, which is 
+134% better than the ENER^ 1 (.10) = .226 result obtained by MP-BOOST (similar 
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improvements can be observed for the other methods and for the other values of £). 
This provides a striking contrast with the classification accuracy results reported in 
Figure [IT where, on the same dataset, MP-Boost ( Ef 1 = .392) substantially outper¬ 
formed oVMs (E^ 1 = .473). It is easy to conjecture that, even if MP-BOOST yields 
higher classification accuracy, it generates less reliable (calibrated) confidence scores, 
i.e., it generates confidence scores that correlate with the ground truth worse than the 
SVM-generated scores. 

The rates of improvement of U-Theoretic(s) over the baseline are instead much 
higher for MP-BOOST than for SVMs (e.g., for £ = .10 these are +109% and +51%, re¬ 
spectively). (The same goes for the improvements of Oraclel (s) over the baseline.) This 
is likely due to the fact that, as observed above, the absolute values of ENER 1 ^ (£) ob¬ 
tained by the baseline are much higher for SVMs than for MP-BOOST for all methods, 
so the margins of improvement with respect to the baseline are smaller for SVMs than 
for MP-Boost. 

6.6.2. Small test sets. We have also run a batch of experiments aimed at assessing how 
the methods fare when ranking test sets much smaller than REUTERS-21578. This 
may be more challenging than ranking larger sets since, when the test set is small, 
Laplace smoothing (i) can seriously perturb the relative proportions among the cell 
counts, which can generate poor estimates of GUI, , fpj) and G(d,, f ri ,) , and (ii) is per¬ 
formed for more classes, since (as discussed at the end of Section |4~3| l we smooth “on 

i i«« , _ ii A M L a Ad L a M L ni-« 

demand only, and since the likelihood that TPj , FPj , FNj are smaller than 1 

is higher with small test sets. This is also a realistic setting since, if a set of unlabelled 

documents is small, it is likely that validating a portion of it that can lead to sizeable 

enough effectiveness improvements is feasible from an economic point of view. 

Rather than choosing a completely different dataset, we generate 10 new test sets by 
randomly splitting the REUTERS-21578 test set in 10 equally-sized parts (about 330 
documents each). In our experiments we run each ranking method on each such part 
individually and average the results across the 10 parts. We call this experimental 
scenario Reuters-21578/10. This allows us to study the effects of test set size on our 
methods in a more controlled way than if we had picked a completely different dataset, 
since test set size is the only difference with respect to the previous REUTERS-21578 
experiments. 

The results displayed in Figure [3] allow us to visually appreciate that U-Theoretic(s) 
substantially outperforms Baseline also in this context. This can be seen also from 
Table [TT] for £ = .10 the relative improvement over Baseline is +110% for MP-BOOST 
and +30% for SVMs, and similarly substantial improvements are obtained for the two 
other values of £ tested. 

Incidentally, note that the REUTERS-21578/10 experiments model an application 
scenario in which a set of automatically labelled documents is split (e.g., to achieve 
faster throughput) among 10 human annotators, each one entrusted with validating 
a part of the set. In this case, each annotator is presented with a ranking of her own 
document subset, and works exclusively on itj£] 

6.6.3. Tiny test sets. In further experiments that we have run, we have split the 
REUTERS-21578 test set even further, i.e., into 100 equally-sized parts of about 33 
documents each, so as to test the performance of Laplace smoothing methods in even 


8 Actually, if we did have k annotators available, the best strategy would be to generate the k rankings in a 
“round robin” fashion, i.e., by allotting to annotator i the documents ranked (in the global ranking) at the 
positions r such that (r mod k) = i. This splitting method would guarantee that only the most promising 
documents are validated by the annotators. 
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Inspection Depth Inspection Depth 


Fig. 3. Results obtained by (a) splitting the REUTERS-21578 test set into 10 random, equally-sized parts, 
(b) running the analogous experiments of Figure[2]independently on each part, and (c) averaging the results 
across the 10 parts. The learners used are MP-BOOST (left) and SVMs (right). 


more challenging conditions. We call this experimental scenario REUTERS-21578/100. 
From an application point of view this is a less interesting scenario than the two pre¬ 
viously discussed ones, since applying a ranking method to a set of 33 documents only 
is of debatable utility, given that a human annotator confronted with the task of vali¬ 
dating just 33 documents can arguably check them all without any need for ranking. 
The goal of these experiments is thus checking whether our method can perform well 
even in extreme, albeit scarcely realistic, conditions. 

The detailed ER^(n) plots for this REUTERS-21578/100 scenario are presented in 
Figure |4| while the ENERp 1 results are reported in Table [TlJ U-Theoretic(s) still out¬ 
performs Baseline, with a relative improvement of +42% with MP-Boost and +21% 
with SVMs with £ = .10, corresponding to p = .696; qualitatively similar improvements 
are obtained with the other tested values of £. 

Note that in these experiments, unlike in those performed on the full REUTERS- 
21578, the Oraclel(s) method proves to be markedly superior to U-Theoretic(s) (e.g., 
.247 vs. .172 in terms of ENERp 1 (.10) with MP-BOOST, and similarly for other values 
of £ and for the SVM learner); unlike in the previous two datasets, the difference be¬ 
tween the two methods turns out to be statistically significant. The reason is that, for 
a smaller test set, (a) distribution drift is higher, (b) “smoothing on demand” is invoked 
more frequently (because the likelihood that contingency table cells have a value < 1 
is higher), and (c) when smoothing is indeed applied the distribution across the cells 
of the contingency table is perturbed more strongly. 

Note also that the ERp I (n) curves are smoother than the analogous curves for the 
full Reuters- 21578 and, although to a lesser extent, those for Reuters- 21578/10. 
This is due to the fact that the curves in Figure [4] result from averages across 100 
different experiments, and the increase brought about at rank n is actually the average 
of the increases brought about at rank n in the 100 experiments. 


9 From the next experiments onwards, for reasons of space we will not include the full plots in the style of 
Figures|2]to[4] and will only report ENER^ 1 results. 
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Inspection Depth Inspection Depth 


Fig. 4. Same as Figure[3]but with REUTERS-21578/100 in place of REUTERS-21578/10. The learners used 
are MP-BOOST (left) ana SVMs (right). 


6.6.4. Large test sets. While in the previous sections we have discussed experiments on 
mid-sized to small (or very small) datasets, we now look at larger datasets such as 
OHSUMED. The OHSUMED results in Table [II] confirm the quality of U-Theoretic(s), 
which outperforms the purely confidence-based baseline by +10% (MP-BOOST) and 
+9% (SVMs) in terms of ENER 1 ^ (.10); qualitatively similar improvements are ob¬ 
tained for the other two values of £ studied. 

The OHSUMED collection is characterized by the presence of an unusually large 
number (93.1% of the entire lot) of unlabelled documents (i.e., documents, that are 
negative examples for all Cj e C) that originally belonged to other subtrees of the 
MeSH tree. Since such a large percentage is unnatural, we have generated (and also 
used in our experiments) a variant of OHSUMED (called OHSUMED-S) by removing 
all the unlabelled documents from both the training set and the test set. 

As illustrated in Table [Il| on OHSUMED-S U-Theoretic(s) outperforms the con¬ 
fidence-based baseline by a very large margin (+374% with MP-BOOST and +127% 
with SVMs for £ = .10, with qualitatively similar results for the other two tested val¬ 
ues of £). 


6.6.5. Discussion. In sum, the results discussed from Section [6.6.1 to the present one 
have unequivocally shown that U-Theoretic(s) outperforms the confidence-based base¬ 
line, usually by a large or very large margin, for all the five tested datasets and for 
both tested learners. 

Note that, for all five datasets and for both learners, the improvements of the utility- 
theoretic methods over Baseline are larger for smaller values of £. This indicates that 
the difference between the two methods is larger for smaller validation depths, i.e., 
where using the utility-theoretic method pays off the most is at the very top of the 
ranking. This is an important feature of this method, since it means that all human 
annotators, be they persistent or not (i.e., independently of the depth at which they 
validate), are going to benefit from this approach. 
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7. AN IMPROVED, “DYNAMIC” RANKING FUNCTION FOR SATC 

The utility-theoretic method discussed in Sectionals reasonable but, in principle, sub- 
optimal, and its suboptimality derives from its “static” nature. To see this, assume that 
the system has ranked the test documents according to the strategy above, that the hu¬ 
man annotator has started from the top of the list and validated the labels of document 
di, that she has found out that its label assignment for class c t is a false negative, and 
that she has corrected it, thus bringing about an increase in F\ equivalent to 

2TO+1) _ 2 TP 3 

2 (TPj + 1) + FPj + (FNj - 1) 2 TPj + FPj + FNj 

Following this correction, the value of FNj is decreased by 1 and the value of TPj 
is increased by 1. This means that, when another false negative for c :j is found and 
corrected, the value of (??) has changed. In other words, the improvement in F 1 due 
to the validation of a false negative is not constant through the validation process. Of 
course, similar considerations apply for false positives. 

This suggests redefining the validation gains defined in Equations ?? and ?? as 

G(di,fpj) 

G(dj, frij) 

To see the novelty introduced with respect to Equation ??, in the following we will dis¬ 
cuss the case of false negatives; the case of false positives is completely analogous. The 
difference between Equation ?? and Equation ?? is that the former equates G(d t , frij) 
with the increase in F, (<I> J (Te)) that would derive by correcting all of the documents in 
FNj divided hy their number, while the latter equates G{di,fnj) with the increase in 
Fi($j(Te)) that would derive by correcting the next document in FNj. In other words, 
we might say that Equation ?? enforces the notion of average gain, while Equation 
?? enforces the notion of pointwise gairlf® ) The two versions return different values of 
G(di, frij): as the following example shows, it is immediate to verify that if FNj con¬ 
tains more than one document, the validation gains G(dj, frij ) that derive by correcting 
different documents are the same (by definition) if we use Equation ?? but are not the 
same if we use Equation ??. 

Example 7.1. Suppose we have classified a set of 100 documents according to class 
Cj, and that the classification is such that TPj = 10, FNj = 20, FPj = 30, and TNj = 
40. According to Equation ??, G(di,fnj) evaluates to « 0.0190 for each false negative 
corrected. Instead, according to Equation ??, G{di,fnj) evaluates to « 0.0241 for the 
1st false negative corrected, « 0.0235 for the 2nd, « 0.0228 for the 3rd, ..., down to 
« 0.0147 for the 20th. □ 

Given this new definition we may implement a dynamic strategy in which, instead of 
plainly sorting the test documents in descending order of their U(di, fi) score, after each 

correction is made we update TP J , FPj , F~Nj by adding and subtracting 1 where 


10 Equations ?? might have also been formulated in a continuous way, i.e., as partial derivatives of Fi in 
the two variables TPj and TNj (in other words, Equations ?? would thus represent the gradient of F \). We 
have preferred to stick to a discrete formulation, since (a) Equations ?? and ?? are instead not naturally 
formulated as derivatives (exactly because they represent average - rather than pointwise - gains), and 
since (b) having Equations ??, ?? and ?? all formulated in a common notation allows an easier comparison 
among them. 


2 TPj 2 TPj 

2 TPj + (FPj - 1 ) + FNj ~ 2 TPj + FPj + FN 0 
2 (TPj + 1 ) 2TPj 

2 (TPj + 1 ) + FPj + (FNj - 1 ) ~~ 2 TPj + FPj + FNj 
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appropriate, we recompute G(di,fpj), G(d z ,frij ) and (/(</,.fi), and we use the newly 
computed U(di,Q,) values when selecting the document that should be presented next 
to the human annotator. In detail, the following steps are iteratively performed: 


(1) For all classes c 3 € C, compute G(d,, fpj) and/or G(d,, frij) using Equations ??; 

(2) If the human annotator does not want to stop validating documents, then identify 
the document d rnax = arg max U (d,;, il) for which total utility is maximised; 

di£Te 

(3) Remove d max from Te; 

(4) For all c, G C, have the human annotator check the label attached by <1> 7 to d max ; 
if all these labels are correct go to Step [2j else, for all classes c, e C for which the 
label attached by 4>j to d max is incorrect: 

(a) Have the human annotator correct the label; 

(b) If dm ax was a false positive for Cj, decrease FP J by 1; if it was a false negative 

~ La * La 

for Cj, increase TP ■ by 1 and decrease FN 3 by 1; 

(c) Re-smooth TP ^ a , FP^ a , F~N^ a if needed; 

(d) Recompute G(d u fpj) and/or (did,, frij) and go back to Step [2j 


This might also be dubbed an incremental ranking strategy, in the sense pioneered in 
I Aalbersberg 1992) for relevance feedback in ad-hoc search, in the sense that the values 
of G(di, fpj) and G(d u fnj) are incrementally updated so that the IHd,, Q) function 
reflects the fact that part of Te has indeed been corrected. In keeping with [Brandt 
et al. 20111 we prefer to call it a dynamic strategy, and to call the one of Section [4] a 


static one. 

Note that in Step[2]we simply compute the maximum element (according to U(d, , ft)) 
of Te instead of sorting the entire set, since we can perform this step in 0(|Te|) instead 
of 0(|Te| log |Te||^] Furthermore, note that in this algorithm the re-computation of 
Uj(di,Gl) does not entail the recomputation of the probabilities P(fpj) and/or P(fnj) 
of Equation ??, since these probabilities are computed (i.e., calibrated) once for all, 
immediately after the training phase. 

Note also that computing validation gains via Equations ?? and ?? is the only possi¬ 
bility within the static method (since the values of G{di,fpj) and G(d t , fnj) produced 
must be used unchanged throughout the process), but is clearly inadequate in a dy¬ 
namic context, in which validation gains are always supposed to be up-to-date reflec¬ 
tions of the current situation. 

T he dy namic nature of this method makes it clear why, as specified at the end of Sec¬ 
tion T3 we smooth the cell count estimates only “on demand” (see also Step [4c] of the 

above algorithm), i.e., only if any of TP 

- ML 

FN , 


ML 


FN ■ is < 1. To see this, suppose 


, - ML 

that we smooth TP , 


ML 

, FP , ^ , 

- ML - ML ... . . . . . 

, FPj , FN 3 at each iteration, even when not strictly needed. 
Adding a count of one to each of them at each iteration means that, after k iterations, 
k counts have been added to each of them; this means that, after many iterations, the 
counts added to the cells have completely disrupted the relative proportions among 
the cells that result from the maximum-likelihood estimation. This would likely make 
the dynamic method underperform the static method, which does not suffer from this 
problem since the maximum-likelihood estimates are smoothed only once. As a result, 


11 When computing this maximum element returns repeatedly a document whose labels are all correct, the 
lack of a sorting step entails the need of computing the maximum element several times in a row with the 
values of G(di, fpj) and G(di, frij) unchanged. In these cases, the presence of a sorting step would thus 
have been advantageous. However, the likelihood that this situation occurs tends to be small, especially 
when |C| is large, thus making the computation of the maximum element preferable to sorting. 
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we smooth a contingency table only when strictly needed, i.e., when one of TP^ IL , 

- ML - ML . 

FPj , FNj is < 1. 

By solving the inequality G{di, fnj) > GUI,, fpj ) we may find out under which con¬ 
ditions correcting a false negative yields a higher gain than correcting a false posi¬ 
tive. It turns out that, when validation gains are defined according to Equation ??, 
G(di, frij) > G(di, fpj) whenever FN + FP > 1, i.e., practically always. Of course, this 
need not be the case for evaluation functions different from F-\ , and in particular for 
instances of Fp with /3 ^ 1. 

From the standpoint of total computational cost, our dynamic technique is 0(|Te| • 
(\C\ + \Te\)), since (i) computing the U(d,, Oj score for \Te\ documents and computing 
their maximum according to the computed t/(d,,0) score can be done in 0(\Te\ ■ \C\) 
steps, and (ii) this step must be repeated 0{\Te\) times. This policy is thus, as expected, 
computationally more expensive than the previous one. 


7.1. Experiments 

The results of the experiments with the dynamic version of our utility-theoretic method 
and of our two oracle-based methods are reported in Figures [2] to [4] and in Table |TT| 
where they are indicated as U-Theoretic(d), Oraclel (d) and Oracle2(d). Of course there 
exists no dynamic version of the baseline method, since this latter does not involve 
validation gains. 

The first observation that can be drawn from these results is the fact that U- 
Theoretic(d) is not superior to U-Theoretic(s), as could instead have been expected. In 
fact, in Figures [2] to [4] the curves corresponding to the former are barely distinguish¬ 
able from those corresponding to the latter, and the numeric r esu lts reported in Table 
[TT] show no substantial difference either; as reported in Table |IH[ in 7 out of 10 cases 
(2 learners x 5 datasets) the difference is not statistically significant. Note that there 
are extremely small differences also between Oraclel (s) and Oraclel (d); again, in 7 out 
of 10 cases no statistically significant difference can be detected. This shows that the 
lack of any substantial difference between static and dynamic is not due to a possi¬ 
ble suboptimality of the method for estimating contingency table cells (including the 
method adopted for smoothing the estimates). Analogously, note also the extremely 
small differences between Oracle2(s) and Oracle2(d) (again, no statistically significant 
difference in 7 out of 10 cases), which indicates that the culprit is not the method for 
estimating the probabilities of misclassification. 

This substantial equivalence between the static and the dynamic methods is some¬ 
how surprising, since on a purely intuitive basis the dynamic method seems definitely 
superior to the static one. We think that the reason for this apparently counterintu¬ 
itive results is that, when validation gains are recomputed in Step [4d] of the algorithm, 
the magnitude of the update (i.e., the difference between validation gains before and 
after the update) is too small to make an impact. This is especially true for large test 
sets, where incrementing or decrementing by 1 the value of a contingency cell makes 
too tiny a difference, since that value is very large. 

Actually, the part of Figure [2] relative to MP-BOOST displays an apparently strange 
phenomenon, i.e., the fact thatror some values of f the Oracle2(s) method outperforms 
Oracle2(d). A similar phenomenon can be noticed in some of the cells of Table |TT] where 
the static version of either Oraclel or Oracle2 outperforms, even if by a smallmargin, 
the dynamic version. This seems especially strange for Oracle2(d), which is the theoret¬ 
ically optimal method (since it is a method that operates with perfect foreknowledge), 
and as such should be impossible to beat. The reason for this apparently counterin¬ 
tuitive behaviour lies not in the ranking methods, but in a counterintuitive property 
of F\, i.e., the fact that, when TP = FN = 0 (i.e., there are no positives in the gold 


ACM Transactions on Knowledge Discovery from Data, Vol. V, No. N, Article A, Publication date: January YYYY. 


A:26 


G. Berardi, A. Esuli, and F. Sebastiani 


Table IV. Comparison between the actual computation times (in sec¬ 
onds) of the U-Theoretic(s) and U-Theoretic(d) methods on our five 
datasets. 


Dataset 

Method 

MP-Boost 

SVMs 

Reuters-21578 

U-Theoretic(s) 

U-Theoretic(d) 

0.426 

3.128 

0.452 

3.021 

Reuters-21578/10 

U-Theoretic(s) 

U-Theoretic(d) 

0.166 

0.195 

0.153 

0.198 

Reuters-21578/100 

U-Theoretic(s) 

U-Theoretic(d) 

0.033 

0.046 

0.033 

0.044 

OHSUMED 

U-Theoretic(s) 

U-Theoretic(d) 

10.282 

500.047 

11.251 

577.864 

OHSUMED-S 

U-Theoretic(s) 

U-Theoretic(d) 

0.418 

4.731 

0.424 

4.195 


standard - and 25 out of 115 classes in the dataset used in Figure [2] have this prop¬ 
erty), its value is 0 when FP > 0 but 1 when FP = 0 (so, TP = FP = FN = 0 is a 
“point of discontinuity” for Fi). This essentially means that, when TP = FN = 0 and 
FP > 0, G(di, frij) is l/\FP\ for the static method and 0 for the dynamic method; i.e., 
in this case the dynamic method does not provide any incentive for correcting a false 
positive, while the static method does. As a result, the static method can speed up the 
correction of false positives more than the dynamic method does. As mentioned above, 
this phenomenon exposes a suboptimality not of the dynamic method, but of the F\ 
function. 

In Table [TV| we report the actual computation times incurred by both U-Theoretic(s) 
and U-Theoretic(d) on our five datasets^] These figures confirm that the dynamic 
method is (as already discussed abovefsubstantially more expensive to run than 
the static method; in particular, the magnitude of this difference, together with the 
marginal (if any) accuracy improvements brought about by the dynamic method over 
the static one, shows that the static method is much more cost-effective than the dy¬ 
namic one. In other words, the bad news is that the dynamic method brings about no 
improvement; the good news is that the computationally cheaper static method is hard 
to beat. 

8. A “MICRO-ORIENTED” RANKING FUNCTION FOR SATC 

In Section [3] we have assumed that the evaluation of classification algorithms across 
the \C\ classes of interest is performed by macro-averaging the F\ results obtained for 
the individual classes Cj € C. Consistently with this view, in Section [5] we have in¬ 
troduced macro-averaged versions of E it ER P , NER p , and ENER p . macro-averaging 
across the classes in \C\ essentially means paying equal attention to all of them, irre¬ 
spective of their frequency or other such characteristics. 


12 The times reported are relative to an experiment in which the entire test set is validated; this is because, 
in a simulated experiment, the entire test set must be validated in order to compute the ER^f ( n ) values 
reported in Figures [2] to [4] In a realistic setting in which only a portion of the ranked list is validated, 
the difference between U-Tneoretic(s) and U-Theoretic(d) is smaller, since the cost of recomputing validation 
gains is roughly proportional to the validation depth, and since this cost affects U-Theoretic(d) but not U- 
Theoretic(s). 
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However, there is an alternative, equally important way to evaluate effectiveness 
when a set of \C\ classes is involved, namely, micro-averaged, effectiveness. While 
macro-averaged measures are computed by first computing the measure of interest 
individually on each class-specific contingency table and then averaging the results, 
micro-averaged measures are computed by merging the \C\ contingency tables into a 
single one (via summing the values of the corresponding cells) and then computing 
the measure of interest on the resulting table. For instance, micro-averaged Pi (noted 
F[' ) is obtained by (i) computing the category-specific values TPj, FP 0 and FN / for all 
Cj € C, (ii) obtaining TP as the sum of the TP/s (same for FP and FN), and then (iii) 
applying Equation ??. Measures such as £/, P //', NER and ENER £ are defined 
in the obvious way. The net effect of using a single, global contingency table is that 
micro-averaged measures pay more attention to more frequent classes, i.e., the more 
the members of a class c 3 in the test set, the more the measure is influenced by Cj. 

Neither macro- nor micro-averaging are the “right” way to average in evaluating 
multi-label multi-class classification; it is instead the case that in some applications 
we may want to pay equal attention to all the classes (in which case macro-averaging 
would be our evaluation method of choice), while in some other applications we may 
want to pay more attention to the most frequent classes (in which case we should opt 
for micro-averaging). 

While we have not explicitly discussed this, the method of Section [4] was devised with 
macro-averaged effectiveness in mind. To see this, note that the U(di,£l) function of 
Equation ?? is based on an unweighted sum of the class-specific Uj(di, ii) scores, i.e., it 
pays equal importance to all classes in C. This means that Equation ?? is optimized for 
metrics that also pay equal attention to all classes, as all macro-averaged measures do. 
We now describe a way to modify the method of Section[4]in such a way that it is instead 
optimal when our effectiveness measure of choice (e.g., ENER P ) is micro-averaged. To 
do this, we do away with Equation ?? and (similarly to what happens for F{‘ and P/) 
compute instead U(di, i i) directly on a single, global contingency table obtained by the 
cell-wise sum of the class-specific contingency tables. That is, we redefine U{d lil ii) as 

U(di,fl) = E E P(uJk)G(di,uJk) (16) 

CjGC W*,e {tpj ,fpj ,fn 3 ,trij} 


where 

G(di,fpj) = ^( Fl FP (HTe)) - Fi($(Te))) 

1 2 TP 2 TP 

~ ~FP^2TP + FN ~ 2TP + FP + FN 

G(di, frij) = JL(F™@{Te)) - Pi($(Te))) 

1 2 (TP + FN) 2 TP 

_ FW 2(TP + FN) + FP ~ 2 TP + FP+FN' 

Equations ?? are the same as Equation ?? and ??, but for the fact that the latter are 
class-specific (as indicated by the index j ) while the former are global. This is due to 
the fact that, when using micro-averaging, there is a single contingency table, and the 
gain obtained by correcting, say, a false positive for c x is equal to the gain obtained by 
correcting a false positive for c y , for any c x , c v £ C. Of course, Equations ?? are to be 
applied when the static method of Section |4]needs to be optimized for micro-averaging; 
when we instead want to do the same optimization for the dynamic method of Section 
|7j we need instead to apply, in the obvious way, “global” versions of Equations ??. 
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Actually, a second aspect in the method of Section [4] that we need to change in or¬ 
der for it to be opti mize d for micro-averaging is the probability calibration method 
discussed in Section 6.2 In fact, Equation ?? is clearly devised with macro-averaging 
in mind, since it minimizes the average across the Cj G C of the difference between 
the number PosJ r and the expected number E \PosJ r ] of positive training examples of 


class Cj. Again, all classes are given equal attention. For our micro-averaging-oriented 
method we thus replace Equation ?? with 


argmin \Pos Tr — E[Pos Tr ]| = 

a 

argmin | ^ PosJ r — E[PosJ r ]| = 

Cj&C CjGC 


arg min | PosJ r 
c 3 eC 

arg min | N PosJ r 

Cjec 


E E p ( c i\ d i)\ = 

Cj-eC diGTr 


E E 

CjGC di&Tr 


a&jidi) 

-1 

e a<S>j(di) _|_ l 


(18) 


where the difference between the number and the expected number of training ex¬ 
amples in the global contingency table is minimized. It is easy to verify that the two 
methods may return different values of a, as the following example shows. 

Example 8.1. Suppose that C = {ci,c 2 }, that PosJ r = 20 and that Pos = 10. 
Suppose that when a = a then E[Posf r ] = 18 and E[Posf r ] = 8, while when a = b 
then E[Posf r ] = 17 and E[Posf r ] = 13. According to Equation ?? value a is better than 
b (since p- J2 Cj ec \P° s J r ~ E[PosJ r ]| is equal to 2 for a = a and to 3 for a = b), but 

according to Equation ?? value b is better than a (since \Pos Tr - E[Pos Tr ] | is equal to 4 
for a = a and to 0 for a = b). □ 


The same smoothing methods as discussed in Section 4.3 can instead be used; however, 
note that smoothing is likely to be needed much less frequently (if at all) here since, 
given that we now have a single global contingency table, it is much less likely that 
any of its cells have values < 1. 


8.1. Experiments 

The experiments with our “micro-oriented” methods are reported in Ta ble [V| Note that, 
since the method we use as baseline corresponds (as noted in Section [6.5| > to using U- 
Theoretic(s) with all validation gains se t to 1, the baseline we use here is different 
from the baseline we had used in Section [6A| since the latter was optimized for macro¬ 
averaging while the one we use here is optimized for micro-averaging. This guarantees 
that, in both cases, our baselines are strong ones. 

The results show that utility-theoretic methods bring about a much slighter im¬ 
provement with respect to the baseline, compared to what we have seen for the 
macro-oriented methods. For instance, for the SVM learner, REUTERS-21578 dataset, 
and validation depth £ = .10, the improvement of our (static) micro-oriented utility- 
theoretic method with respect to the baseline is just +2%, while the improvement 
was +51% for the equivalent macro-oriented method. Across the two ranking meth¬ 
ods (static and dynamic), five datasets, two learners, and three values of inspection 
depth studied, improvements range from -1% (i.e., in a few peculiar cases we even 
have a small deterioration) to +14%, much smaller than in the macro-oriented case in 
which the improvements ranged between +2% and +402%. 


ACM Transactions on Knowledge Discovery from Data, Vol. V, No. N, Article A, Publication date: January YYYY. 






Utility-Theoretic Ranking for Semi-Automated Text Classification 


A:29 


Table V. As Table|TT] but with ENERp(£) in place of ENER ™(£). 




MP-Boost 

SVMs 




= 0.05 

€ 

= 0.10 

f 

= 0.20 

€ 

= 0.05 

£ 

= 0.10 

£ 

= 0.20 


Baseline 

.107 

.167 

.222 

.240 

.325 

.389 

oo 

in 

<M 

U-Theoretic(s) 

.107 

(+0%) 

.168 

(+1%) 

.224 

(+1%) 

.246 

(+3%) 

.332 

(+2%) 

.395 

(+2%) 

U-Theoretic(d) 

.107 

(+0%) 

.167 

(+0%) 

.224 

(+1%) 

.246 

(+3%) 

.331 

(+2%) 

.394 

(+1%) 

02 

05 

W 

EH 

Oraclel(s) 

.107 

(+0%) 

.168 

(+1%) 

.224 

(+1%) 

.246 

(+3%) 

.332 

(+2%) 

.395 

(+2%) 

Oraclel(d) 

.107 

(+0%) 

.167 

(+0%) 

.224 

(+1%) 

.246 

(+3%) 

.331 

(+2%) 

.395 

(+2%) 

Ph 

Oracle2(s) 

.333 

(+211%) 

.448 

(+168%) 

.512 

(+131%) 

.394 

(+64%) 

.506 

(+56%) 

.556 

(+43%) 


Oracle2(d) 

.333 

(+211%) 

.448 

(+168%) 

.512 

(+131%) 

.394 

(+64%) 

.506 

(+56%) 

.556 

(+43%) 

O 

Baseline 

.110 

.169 

.222 

.232 

.317 

.380 

00 

U-Theoretic(s) 

.112 

(+2%) 

.171 

(+1%) 

.224 

(+1%) 

.237 

(+2%) 

.323 

(+2%) 

.386 

(+2%) 

lO 

U-Theoretic(d) 

.113 

(+3%) 

.171 

(+1%) 

.224 

(+1%) 

.238 

(+3%) 

.322 

(+2%) 

.383 

(+1%) 

m 

05 

W 

Eh 

Oraclel(s) 

.112 

(+2%) 

.171 

(+1%) 

.224 

(+1%) 

.237 

(+2%) 

.324 

(+2%) 

.386 

(+2%) 

Oraclel(d) 

.113 

(+3%) 

.171 

(+1%) 

.224 

(+1%) 

.238 

(+3%) 

.324 

(+2%) 

.386 

(+2%) 

i-> 

w 

Ph 

Oracle2(s) 

.325 

(+195%) 

.438 

(+159%) 

.502 

(+126%) 

.385 

(+66%) 

.496 

(+56%) 

.547 

(+44%) 

Oracle2(d) 

.325 

(+195%) 

.438 

(+159%) 

.502 

(+126%) 

.385 

(+66%) 

.496 

(+56%) 

.547 

(+44%) 

O 

Baseline 

.102 

.158 

.208 

.223 

.301 

.361 

00 

U-Theoretic(s) 

.107 

(+5%) 

.163 

(+3%) 

.212 

(+2%) 

.224 

(+0%) 

.305 

(+1%) 

.366 

(+1%) 

m 

U-Theoretic(d) 

.106 

(+4%) 

.162 

(+3%) 

.211 

(+1%) 

.226 

(+1%) 

.304 

(+1%) 

.363 

(+1%) 

CM 

m 

Oraclel(s) 

.115 

(+13%) 

.170 

(+8%) 

.216 

(+4%) 

.232 

(+4%) 

.317 

(+5%) 

.377 

(+4%) 

05 

W 

Oraclel(d) 

.116 

(+14%) 

.170 

(+8%) 

.217 

(+4%) 

.235 

(+5%) 

.322 

(+7%) 

.383 

(+6%) 

£ 

W 

Oracle2(s) 

.318 

(+212%) 

.429 

(+172%) 

.492 

(+137%) 

.367 

(+65%) 

.481 

(+60%) 

.534 

(+48%) 

Ph 

Oracle2(d) 

.318 

(+212%) 

.429 

(+172%) 

.492 

(+137%) 

.367 

(+65%) 

.481 

(+60%) 

.534 

(+48%) 


Baseline 

.442 

.552 

.583 

.492 

.600 

.620 

Q 

U-Theoretic(s) 

.440 

(+0%) 

.549 

(-1%) 

.580 

(-1%) 

.496 

(+1%) 

.602 

(+0%) 

.621 

(+0%) 

w 

U-Theoretic(d) 

.442 

(+0%) 

.552 

(+0%) 

.582 

(+0%) 

.496 

(+1%) 

.602 

(+0%) 

.621 

(+0%) 

£ 

Oraclel(s) 

.439 

(-1%) 

.549 

(-1%) 

.580 

(-1%) 

.497 

(+1%) 

.602 

(+0%) 

.621 

(+0%) 

X 

o 

Oraclel(d) 

.441 

(+0%) 

.551 

(+0%) 

.582 

(+0%) 

.497 

(+1%) 

.603 

(+1%) 

.621 

(+0%) 


Oracle2(s) 

.660 

(+49%) 

.733 

(+33%) 

.711 

(+22%) 

.704 

(+43%) 

.761 

(+27%) 

.727 

(+17%) 


Oracle2(d) 

.660 

(+49%) 

.733 

(+33%) 

.711 

(+22%) 

.704 

(+43%) 

.761 

(+27%) 

.727 

(+17%) 


Baseline 

.044 

.068 

.094 

.058 

.096 

.136 

m 

U-Theoretic(s) 

.044 

(+1%) 

.069 

(+3%) 

.096 

(+2%) 

.063 

(+10%) 

.102 

(+7%) 

.143 

(+5%) 

Q 

W 

U-Theoretic(d) 

.044 

(+1%) 

.070 

(+3%) 

.097 

(+3%) 

.066 

(+14%) 

.104 

(+9%) 

.144 

(+6%) 

s 

Oraclel(s) 

.044 

(+1%) 

.069 

(+3%) 

.096 

(+2%) 

.064 

(+10%) 

.103 

(+8%) 

.143 

(+5%) 

02 

K 

Oraclel(d) 

.044 

(+1%) 

.070 

(+3%) 

.097 

(+3%) 

.066 

(+15%) 

.105 

(+10%) 

.144 

(+6%) 

o 

Oracle2(s) 

.149 

(+242%) 

.221 

(+227%) 

.287 

(+205%) 

.175 

(+203%) 

.259 

(+171%) 

.330 

(+143%) 


Oracle2(d) 

.149 

(+242%) 

.221 

(+227%) 

.287 

(+205%) 

.175 

(+203%) 

.259 

(+171%) 

.330 

(+143%) 


The main reason for these much smaller improvements lies in the combined ac¬ 
tion of two factors. The first factor is that the validation gains of Equations ?? are 
computed on the global contingency table, whose cells contain very large numbers, 
\C\ times larger than the values in the local contingency tables of the macro-oriented 
method. This means that, since the values of the validation gains are very small (given 
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that an increase or a decrease by 1 of very large values brings about little difference), 
the difference between Gtd,, fpj ) and G(d,. frij) is even smaller. This makes the differ¬ 
ence between the utility-theoretic methods and the baseline smaller. The second factor 
is that the utility function of Equation ??, by collapsing all the class-specific utility 
values for a document into a single value, tends to dwarf the differences between the 
documents. 

It should also be noted that, in the micro-oriented method, improvements are small 
also because the margins of improvement are small. To witness, the improvements 
brought about by Oracle2(d) (our theoretical upper bound) with respect to the base¬ 
line are smaller than for the macro-oriented method. For instance, for the MP-BOOST 
learner, REUTERS-21578 dataset, and validation depth £ = .10, this improvement is 
+168%, while it was +571% for the macro-oriented method. So, improving over the 
baseline is more difficult for the micro-oriented method than for the macro-oriented 
one. The reason why the margins of improvement are smaller is that, when accuracy 
is evaluated at the macro level, the infrequent classes play a bigger role than when 
evaluating at the micro level. Infrequent classes are such that a large reduction in 
error can be achieved even by validating a few documents of the right type (i.e., false 
negatives). As a consequence, for the infrequent classes a ranking method that pays 
attention to validation gains has the potential to obtain sizeable improvements in ac¬ 
curacy right from the beginning; and a method that favours the infrequent classes 
tends to shine when evaluated at the macro level. 


9. CONCLUSIONS 

We have presented a range of methods, all based on utility theory, for ranking the 
documents labelled by an automatic classifier. The documents are ranked in such a 
way as to maximize the expected reduction in classification error brought about by a 
human annotator who validates a top-ranked subset of the ranked list. We have also 
proposed an evaluation measure for such ranking methods, based on the expectation 
of the (normalized) reduction in error brought about by the human annotator’s valida¬ 
tion activity. This “semi-automated document classification” task is different from “soft 
(document-ranking) classification”, since in the latter case it is the documents with the 
highest probability of being members of the class (and not the ones which bring about 
the highest expected utility if validated) that are top-ranked. 

Experiments carried out on standard datasets and variants thereof show that the in¬ 
tuition of using utility theory is correct. In particular, of four methods studied, we have 
found that two methods optimized for micro-averaged effectiveness bring about only 
limited improvements, while the two methods optimized for macro-averaged effective¬ 
ness deliver drastically improved performance with respect to the baseline. We have 
also found that the two “static” methods, while seemingly inferior to the “dynamic” 
ones on a purely intuitive basis, perform as well as the dynamic ones at a fraction of 
the computational cost. 

It should be remarked that the very fact of using a utility function, i.e., a function in 
which different events are characterized by different gains, makes sense here since we 
have adopted an evaluation function, such as F\, in which correcting a false positive or 
a false negative brings about different benefits to the final effectiveness score. If we in¬ 
stead adopted standard accuracy (i.e., the percentage of binary classification decisions 
that are correct) as the evaluation measure, utility would default to the probability of 
misclassification, and our method would coincide with the baseline, since correcting a 
false positive or a false negative would bring about the same benefit. The methods we 
have presented are justified by the fact that, in text classification and in other classifi¬ 
cation contexts in which imbalance is the rule, F\ is the standard evaluation function, 
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while standard accuracy is a deprecated measure because of its lack of robustness to 
class imbalance (see e.g., ISebastiani 2002, Section 7.1.2] for a discussion of this point). 

The methods we have proposed are valid also when a different instantiation of the Fp 
function (i.e., with T / l) is used as the evaluation function. This may be the case, e.g., 


when classification is to be app lied to a recall-oriented task (such as e-discovery [Oard 
et al. 2010} |Oard and Webber 2013| |), in which case values /3 > 1 are appropriate. In 


these cases our utility-theoretic method can be used once the appropriate instance of 
Fp is plugged, in place of F \, into the equations defining the validation gains (and into 
the equations that lead to the definition of ENER P ({;)). The same trivially holds for 
any other evaluation function, even different from Fp and even multivariate and non¬ 
linear, provided it can be computed from a contingency table. It is easy to foresee that, 
the higher the difference between the roles that false positives and false negatives play 
into the chosen function, the bigger the improvements brought about by the utility- 
theoretic methods with respect to the baseline are going to be. (For instance, it is easy 
to foresee that these improvements would be higher for f\ than for f \.) 

We also remark that this technique is not limited to text classification, but can be 
useful in any classification context in w hich class imbalance [He and Garcia 2009J, or 
cost-sensitivity in general [Elkan 2001|, suggest using a measure (such as Fp) that 
caters for these characteristics. 

Note that, by using our methods, it is also easy to provide the human annotator with 
an estimate of how accurate the labels of the test set are as a result of her validation 


ML 


activity. In fact, if the contingency cell maximum-likelihood estimates TP , FP 


ML 


4.3 1 are updated (adding and subtracting 1 where appropriate) 


, - ML 

and FN ■ (see Section 
after each correction by the human annotator, at any point in the validation activity 
these are up-to-date estimates of how well the test set is now classified, and from these 
estimates Pi (or other) can be computed as usual. 

In the future, we would like to try applying a SATC method after a transductive 
learner (e.g., Transductive SVMs [ Joachims 1999) ) has been used to generate the base 
classifier in place of the standard inductive learners we have used in this work. A 
transductive method, rather than attempting to generate a model that minimizes the 
expected risk on any test set, attempts to minimize misclassifications on a specific test 
set. When the focus of one’s application is squeezing the highest possible accuracy from 
a specific test set, as is the case when using SATC, it would thus make sense to use a 
transductive instead of an inductive learning method. 
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